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Abstract 


•Systolic  arrays  are  construct^  for  bandwidth  reduction  and  singular  value  decomposition 
ofmxn  matrices,  wlog,  m  Jp  n.  The  underlying  algorithms  are  unconditionally  stable. 

Independently  of  the  bandwidth,  w,  or  the  order  of  the  matrix,  a  4fc2  processor  array  \ 
accomplishes  a  reduction  to  bidiagonal  form  in  time  0(wm2Jfc2);  subsequent  determina-  \ 
tion  of  a  singular  value  by  the  Golub- Re insch  SVD  iteration  takes  O(n)  steps.  With  O(m)  \ 
4w2  processor  arrays  the  reduction  time  becomes  sublinear,  resulting  in  0(m/w+an)  steps  \ 
to  compute  a  singular  values,  compared  to  sequential  times  of  0(m2+sn)  or  0(mtw*+an).  ) 

The  array  sizes,  in  contrast  to  many  other  designs,  do  not  depend  on  the  order,  m  or _ 

n,  of  the  matrix  rendering  it  possible  to  process  prnhlems  of  arbitrary  size‘s  Since  Input 
and  output  occurs,  as  in  previous  designS^[Help83,  KuLe78]|  by  diagonals  arrays  can  be 
directly  appended  to  further  reduce  the  computation  time.  Consequently,  the  designs  will 
be  most  efficient  for  matrices  with  a  fairly  small  and  dense  band. 


Introduction 


In  response  to  the  increasing  demands  for  computing  power  as  well  as  for  real  time  processing, 
which  exceed  the  capabilities  of  single  processor  systems,  the  implementation  of  systolic  arrays  in 
very  large  scale  integrated  (VLSI)  circuits  represents  a  cost  effective  way  to  exploit  parallelism. 

A  systolic  array  can  be  described  as  an  array  of  simple  processors  that  routes  data  in  a  regular 
fashion,  fully  exploits  parallel-pipelined  processing  and  makes  maximum  use  of  data  fetched  from 
memory.  The  basic  design  philosophy  is  stated  in  [KuLe78,  Kung8l],  according  to  which  numerous 
designs  for  triangular  decompositions,  linear  system  solution  and  eigenvalue  problems  have  been 
proposed  (some  of  those  are  surveyed  in  [Help83]).  The  present  paper  introduces  designs  for 
bandwidth  reduction  and  singular  value  computations  of  densely  banded  matrices. 

Since  systolic  arrays  rely  heavily  on  pipelining,  they  are  very  efficient  for  long  matrices.  Hence, 
the  number  of  processors  per  array  (including  I/O  ports)  should  not  depend  on  the  order  of  the 
matrix.  Aside  from  its  order,  tnxn,  another  topological  characteristic  of  a  matrix  is  the  bandwidth, 
w.  Frequently,  w  is  small  (w  =  3,5  for  PDEs),  so  inexpensive  arrays  of  w  or  w2  processors  can 
process  arbitarily  long  matrices  and  achieve  excellent  hardware  utilisation.  Furthermore,  input 
and  output  does  not  occur  by  rows  or  columns  but  rather  codiagonals  (subdiagonals,  diagonal, 
superdiagonals).  If  input  and  output  addressing  schemes  are  identical,  it  becomes  feasible  to 
chain  several  (possibly  different)  arrays  so  as  to  directly  pipeline  the  results  from  one  into  the 
other  obviating  the  need  for  intermittent  memory  transfers.  From  a  numerical  point  of  view,  the 
•underlying  algorithms  must  be  unconditionally  stable,  on  account  of  the  largely  data  independent 
control,  and  permit  easy  decomposition  of  the  problem  in  case  of  a  disagreement  between  hardware 
and  input  sizes. 

The  already  existing  designs  for  singular  value  computations  (SVD)  possess  a  hardware  com¬ 
plexity  proportional  to  the  order  of  the  matrix,  superlinear  time  complexity  and  little  capability 
for  handling  problems  that  are  larger  than  the  given  hardware.  An  0(rt2)  processor  singular  value 
array  ol  [FiLP82]  relies  on  a  version  of  Hestenes’  method  where  intermediate  quantities  are  not 
properly  updated;  a  formal  convergence  proof  is  not  provided.  [BrLu82]  construct  an  array  for  a 
one-sided  orthogonalisation  method  due  to  Hestenes  which  uses  either  a  linearly  connected  mesh  of 
0(n)  processors  and  0(mn)  steps  to  determine  a  singular  value  or  else  a  two-dimensional  O(mn) 
processor  array  with  a  non-planar  interconnection  structure  and  time  0(n  log  m).  The  method  is 
quadratically  convergent;  experience  suggests  that  6  to  10  iterations  per  singular  value  provide  for 
sufficient  numerical  accuracy.  In  (BrLV83a]  a  similar  architecture  of  0(n2)  processors  implements 
a  cyclic  Jacobi  method  for  the  SVD  in  0(m  +  n  logn)  steps.  With  the  addition  of  QR  and  matrix 
multiplication  arrays,  the  generalised  SVD  for  matrices  A  6  Rmxn  and  B  €  Rp*n  is  computed  in 
0(1  logn)  units  of  time  on  0(l2)  processors,  l  >  m,n,p.  With  reference  to  the  previous  works, 
Schreiber  [Schr83b]  suggests  a  method  to  cope  with  problems  that  do  not  match  the  array  size.  In 
[Schr83a]  he  proposes  a  Jfcn  processor  design  which  reduces  a  matrix  to  upper  triangular  form  with 
bandwidth  k+ 1  in  time  0(mn/k).  A  k(k+ 1)  processor  array  from  [Help83]  is  used  to  implement  a 
SVD  iteration  on  k  + 1  diagonal  matrices  (analogous  to  the  Golub-Reinsch  iteration  for  bidiagonal 
matrices)  in  time  6n  +  0(fc).  For  k  =  O  (^/n)  this  amounts  to  processor  and  hardware  requirements 
of  O  (ns/2) . 

In  contrast,  the  final  bandwidth  reduction  design  presented  here  consists  of  l  planar,  orthog¬ 
onally  connected  arrays  of  4k(eJc  +  1)  processors  each  with  an  array  traversal  time  of  2(n  +  4 fit) 
steps,  e  >  1.  An  iteration  of  the  subsequent  Golub-Reinsch  SVD  iteration  for  bidiagonal  matrices 
can  be  done  in  2n  steps.  On  the  average  3  iterations  are  required  to  find  a  singular  value  [Parl80]. 

The  time  to  find  a  singular  value  comes  to  0(wm2/lP  +an)  and  0(m2/w+an)  if  i  is  proportional 
to  w.  In  the  event  that  J  *  0(m)  arrays  are  available  the  time  is  further  reduced  to  0(m/w  +  an) 
as  opposed  to  the  sequential  process  with  0(m2w2  +  an)  steps.  The  bandwidth  reduction  array 
is  flexible  in  the  sense  that  it  can  handle  matrices  of  arbitrary  size  and  bandwidth.  The  data 
addressing  scheme  is  identical  to  the  one  employed  in  [Help83,  KuLe78);  consequently  a  variety  of 


arrays  can  be  chained  to  avoid  costly  memory  transfers  and  repeated  alignment  of  data.  The  basic 
operations  in  all  algorithms  are  generation  and  application  of  Givens’  plane  rotations,  resulting  in 
unconditional  stability  of  all  computations. 

The  following  is  an  overview  of  the  methods  to  be  implemented  : 

1.  Algorithm  BWO.  A2x(«-3)  array  removes  the  leading  element  of  one  subdiagonal  and  one 
superdiagonal.  If  1  arrays  are  chained,  a  reduction  to  bidiagonal  form  takes  0(n2w/l)  steps. 

2.  Algorithm  BWK.  A  4ix(ci+l)  array  removes  the  leading  (w-l-k)  elements  of  ft  subdiagonals 

or  ft  superdiagonals.  If  J  arrays  are  chained,  bidiagonal  form  is  achieved  in  0{nzv>/TJtZ)  steps. 

3.  Algorithm  SVG.  A  4  processor  matrix  transposition  array  directly  chained  to  a  4  processor 
matrix  multiplication  array  forms  the  product  A?  A  in  time  2n  +  4,  while  a  3  processor  linearly 
connected  mesh  accomplishes  one  (explicitly  shifted)  QR  iteration  in  4n  +  2  steps. 

4.  Algorithm  SVI.  A  5  processor  array  performs  one  (implicitly  shifted)  Golub-Reinsch  iteration 
for  bidiagonal  matrices  in  time  2n  +  3.  An  analysis  of  numerical  accuracy,  computation  time 
and  hardware  flexibility  reveals  a  combination  of  2  and  4  as  most  advantegeous. 

As  for  the  organisation  of  this  paper,  the  sequential  versions  of  the  algorithms  are  discussed 
in  the  above  order,  followed  by  a  description  of  their  systolic,  parallel  implementations.  One 
paragraph  on  matrix  transposition  arrays  is  included. 

The  following  assumptions  will  be  made  throughout  the  paper.  The  terms  ‘cell’  and  ‘proces¬ 
sor’  are  used  interchangeably.  Processors  perform  one  of  three  operations  :  generation  of  plane 
rotations,  application  (propagation)  of  plane  rotations  or  shifting  of  elements  (output  of  an  ele¬ 
ment  unaltered  to  the  left,  right  or  top  of  the  processor).  A  systolic  array  is  made  up  of  linearly 
connected  meshes  of  processors  (introduced  in  [Help83])  which  eliminate  at  most  one  subdiagonal 
or  one  superdiagonal.  Cells  of  a  linearly  connected  mesh  are  numbered  consecutively  from  left  to 
right,  starting  with  1.  Linear  meshes  within  an  array  are  numbered  from  bottom  to  top,  again 
starting  with  I.  Overbars,  e.g.,  tfr,  ft,  indicate  hardware  parameters. 

As  a  matter  of  consistency,  time  is  measured  with  regard  to  plane  rotations.  One  unit  of 
time  (or  time  step)  is  long  enough  to  aeommodate  either  the  generation  or  the  application  of 
a  plane  rotation.  For  a  detailed  account  on  the  execution  time  of  addition,  multiplication  and 
square  root  operations  involved  in  a  rotation  the  reader  is  referred  to  [SaKu78].  Implementation 
of  the  algorithms  is  not  affected  by  the  synchronicity  or  asynchronicity  of  the  arrays.  However, 
the  determination  of  computation  speed  is  based  on  synchronous  designs,  where  the  computation 
time  of  the  slowest  processor  is  taken  as  a  unit.  The  traversal  time  of  an  m  x  n  matrix,  m  >  n, 
through  an  array  of  k  linear  meshes  is 

T{m,i)  s  2(m  +  k)  -  l. 

In  most  of  the  figures,  only  the  indices  instead  of  the  matrix  elements  proper  will  be  displayed, 
since  it  is  the  position  within  the  array  that  matters  rather  than  the  value  of  an  element.  One  bit 
data  lines  are  omitted  in  illustrations,  their  value  is  obvious  from  the  direction  of  data  flow. 

Plano  Rotations 

The  singular  value  decomposition  of  a  matrix  A  6  Rm*n  can  be  represented  as 

A  =  ULVr, 

wnere  U  and  V  are  orthogonal  matrices  and  £  a  nonnegative  diagonal  matrix.  Its  diagonal  elements, 
the  eigenvalues  of  VATA,  are  called  the  singular  values  of  A. 

In  order  to  avoid  an  inordinate  number  of  arithmetic  operations  the  matrices  to  be  considered 
are  of  bidiagonal  form  (i  >  /  or  »  <  /  -  1  o,;-  a  0).  A  stable  reduction  to  this  simpler  form 

B  -  PAQt 
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by  orthogonal  equivalence  transformations  does  not  alter  the  singular  values  of  the  original  matrix 
A.  The  orthogonal  matrices  P  and  Q  are  products  of  Givens’  plane  rotations. 

The  standard  Givens’  rotation,  P,  is  a  computationally  stable  device  for  introducing  a  single 
tero  element  into  a  vector  or  matrix  [Parl80,  Wilk65|. 

p-(-.  :)• 

»  :::  »)-(»  %  :::  3)- 

where 

if  yi»0  then  »  i|,  c  =  1,  a  =  0, 
else  e  = 

x'i  ■  exi  +  ay,-,  y '•  =  -sx,  +  ey,-,  2  < 

A  detailed  treatment  of  plane  rotations  and  their  implementation  in  VLSI  is  found  in  [Help83, 
Ipse83).  Time  complexity  will  be  measured  in  terms  of  plane  rotations  generated  or  applied,  i.e., 
(PI)  or  (P2),  each,  is  assumed  to  count  as  one  time  step. 

All  of  the  algorithms  to  follow  (including  the  singular  value  computations)  rely  solely  on 
the  reduction  of  the  bandwidth  through  Givens’  Rotations.  Because  these  eliminations  proceed 
codiagonalwise  from  without  toward  the  diagonal  and  the  elimination  strategy  for  the  ensuing 
fill-in,  known  as  ‘chasing  the  bulge’  (Parl80),  is  similar  to  one  by  Rutishauser  [Ruti63,  Wilk6S] 
the  width  of  the  band  in  any  given  row  at  all  times  is  preserved.  Consequently,  the  amount  of 
hardware  for  the  systolic  networks  will  be  independent  of  the  matrix  dimensions. 

Sequential  Bandwidth  Reductions 

Two  bandwidth  reduction  methods,  chosen  to  avoid  temporary  increase  of  the  width  of  the 
band,  are  presented.  The  first  one  removes  the  outermost  sub-  and  superdiagonal  while  the  second 
one  eliminates  k  sub-  or  superdiagonals  at  the  same  time.  Assume  for  the  time  being  that  matrices 
are  square  of  order  n. 

Removal  of  the  outermost  codiagonal  pair,  i.e.,  the  qth  subdiagonal  and  the  pth  superdiagonal, 
is  accomplished  by  chasing  the  bulge  from  two  sides  as  shown  in  Algorithm  BWO  [Ruti63]. 


k>  l, 


V.  (PI) 
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*1 

i  <  k.  (P2) 


Repeat  until  the  matrix  A  is  of  order  1  : 

1.  Generate  and  apply  and  Pt,p+t  to  annihilate  elements  o,+iti  and  ai,p+i,  respectively, 

causing  fill-in  a9,w  and  ow,p. 

2.  Removal  of  the  leading  subdiagonal  element  in  step  1  entails  rotations  P(-+ij+,(w_1)i,(w_J)  in 
the  upper  and  P7+(,_1)(w_i)i,-(»_i)+i  in  the  lower  triangle  of  A,  i  >  1,  while  elimination  of  the 
leading  superdiagonal  element  in  step  1  entails  removal  of  fill-in  by  P«(»~i)+ij>+(,-i)(«,_i)  in 
the  lower  and  P,(w-i).(p+i)+«(»-i)  in  the  upper  triangle  of  A,  i  >  1. 

3.  Disregard  the  leading  row  and  column  in  all  subsequent  steps. 

Algorithm  BWO.  Removal  of  the  Outermost  Codiagonal  Pair  of  a  Matrix  A. 


The  sequence  of  steps  1  to  3  is  executed  O(n)  times  as  it  annihilates  one  superdiagonal  and 
one  subdiagonal  element  per  loop  traversal.  0(n2)  rotations  are  generated,  each  of  which  takes 
O(w)  steps  to  be  applied.  The  fill-in  created  and  removed  in  step  2  occurs  in  subdiagonal  (q  +  1) 
and  superdiagonal  (p  +  1).  After  each  pass  through  the  loop  the  matrix  order  is  considered  to 


3 


reduced  by  one.  If  for  simplicity  q  =  p  —  1  is  presumed,  the  bandwidth  of  the  matrix  is  reduced  by 
two  after  each  removal  of  a  codiagonal  pair  and  the  number  of  time  steps  amounts  to 


q  n-«+(i-») 

Tq  =  '£'(»-2U-1))  E  <»-«  +  » 

i=t  >=1 

“  5  E  (•  -  2(i  ~  l))(n  -  q  +  i)(»  -  q  + -  I) 

i= I 

=  +  1)  -  +  0(rw*). 

2  4 

For  differing  p  and  4,  the  computation  time  is  proportional  to  0(*m2  max{p,q});  the  dominant 
factor  may  double,  since  the  width  of  the  band  is  reduced  |  q  —  p  +  1  |  times  by  one  instead  of  two. 

The  second  method  is  based  on  a  different  order  of  elimination;  Jfc  >  1  codiagonals  are  removed 
by  pre(post)multiplication,  thereby  introducing  fill-in  which  is  annihilated  by  post(pre)multi- 
plication.  The  leading  rows  and  columns  now  containing  zeros  in  those  codiagonals  are  deflated. 
Then  the  process  is  repeated  on  the  remaining  matrix,  still  having  (9+  1)  elements  in  its  leading 
column  and  (p  +  1)  elements  in  its  leading  row.  The  fill-in  which  arises  while  eliminating  one  set 
of  codiagonals  is  entirely  removed  before  proceeding  with  the  elimination  of  a  further  set.  This  is 
illustrated  in  algorithm  BWK  where  one  loop  traversal  causes  the  annihilation  of  the  (w  -  k  -  1) 
leading  elements  of  each  of  the  k  outermost  subdiagonals. 


Repeat  until  the  matrix  A  is  of  order  1  : 

1.  Generate  Givens’  rotations  to  eliminate  by  premultiplication  the  k  outermost  subdiagonals 
thereby  causing  fill-in  of  k  additional  superdiagonals. 

2.  Generate  Givens’  rotations  to  remove  by  postmuitiplication  this  fill-in  and  thereby  restore  the 
previously  removed  subdiagonals  (now  containing  zeros  in  their  (w-Jfc- 1)  leading  positions). 

3.  Disregard  the  (w  —  k  —  1)  leading  rows  and  columns. 

Algorithm  BWK.  Removal  of  Jb  <  q  Subdiagonals  in  a  Matrix  A. 


Figure  1  gives  an  example  of  a  reduction  to  bidiagonal  form  of  a  matrix  with  q  =*  2  and  p  =  3;  in 
this  case  annihilation  of  subdiagonals  takes  place  before  annihilation  of  superdiagonals. 

As  will  be  shown  subsequently,  elimination  of  one  subdiagonal  implies  deflation  of  the  (w  —  2) 
leading  rows  and  columns  in  step  3.  When  starting  the  removal  of  the  next  subdiagonal,  the 
bandwidth  is  essentially  decreased  to  */  =  w  - 1.  As  before,  the  leading  ( w1  - 2)  rows  and  columns 
can  be  disregarded.  Thus,  annihilation  of  k  subdiagonals  requires  deflation  of  ( v  —  Jfc  - 1 )  rows  and 
columns  in  step  3  of  Algorithm  BWK. 
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(a)  Removal  of  q  =  2  Subdiagonals. 
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(b)  Subsequent  Removal  of  p  -  1  =  2  Superdiagonals. 
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Figure  1.  Reduction  to  Bidiagonal  Form  of  a  Square  Matrix,  n  =  11,  p  =  3,  7  =  2. 


Consider  the  annihilation  of  the  outermost,  qth,  subdiagonal  of  a  matrix  A  with  bandwidth 
»  =  p  +  7  + 1,  for  instance.  The  removal  of  its  «th  element  a9+ti,-,  1  <  i  <  n  -  (p  +  q),  is  effected  by 

p.. .  ( •••  a»+«'-i,r+?+i-i  0 

'  \  ae+M  •  *  •  aT+1>+7+l'-l  a9+i,p+4+i 

a»+»-i,i  •••  i,p+e+»— i  '  i+i-i.r+i+i  \ 

®  °«+«,p+e+»->  °e+»>+e+»  / 

The  fill-in  accumulates  in  superdiagonal  (p  +  1)  as  o*|f  liPln|t-  1  <  •  <  n  -  (p  +  7).  Elimination 
of  aq+jj,  where  n  -  (p  +  7)  +  1  <  »  <  n  -  q,  does  not  create  new  nonzero  elements.  Let  A  =  A'. 
Now  remove  element  a9+,-_iip+4+,-,  1  <  «  <  n  -  7  -  (p  +  7)  +  1,  of  this  new  superdiagonal  by 


(  a<r+i-I,p+q+i-l  ®f+f— 

I  a*+*>+*+»'-l  at+t\p+9+* 


1  af+S+S—l+»— liP+e-W— 1 

V  0  OT+H^-H'-l.r+^+i 

f  °rK-l*+f+i-l  '  0  ' 

ar+i,r>H+i-i  af+i,r+t+> 

1 

at+p+q-  ar+r+e-i-H-i,r+»+* 

V  at+r+q+<-i  a  ?+?+?+*- 
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which  at  the  same  time  restores  the  qth  subdiagonal  with  nonzero  elements  a?+p+^+,-_ i  iP+g+t-_ 1 , 
1  <  *  <  n  -  q  -  (p  +  q)  +  l.  Removal  of  a9+p+9+,-^ifP+1+i^i  for  n-q-(p  +  q)  +  2  <  t  <  n-q  does 
not  result  in  the  creation  of  further  nonzero  elements. 

The  first  nonzero  element  of  subdiagonal  q,  a?ii,  occupies  position  (q+p+q,p+q)-  The  deflated 
matrix  should  be  of  the  same  shape  as  the  original  one,  i.e.,  contain  a  dense  (q  +  1)  x  (p  +  1) 
leading  submatrix.  Moreover,  a?1i  should  now  be  in  row  q  +  1  and  column  1.  Consequently, 
(q  + p  + q)  -  (q  +  l)  —  w  -  2  rows  and  (p  +  q)-  1  —  w-2  columns  must  be  deflated. 

Removal  of  superdiagonals  takes  place  in  an  analogous  manner.  The  systolic  networks  to  be 
presented  require  the  set  of  k  codiagonals  to  consist  of  either  subdiagonals  or  superdiagonals,  but 
not  both.  Furthermore,  it  seems  to  be  inconsequential  whether  subdiagonals  are  deleted  first  or 
superdiagonals  [Ipse83]. 

The  removal  of  k  codiagonals  (subdiagonals  or  superdiagonals)  entails  [ 1  passes  through 
the  loop  of  Algorithm  6WK,  thus  a  reduction  to  bidiagonal  form  requires 

passes,  assuming  that  k  is  fixed.  To  obtain  the  total  number  of  arithmetic  operations  in  a  reduction 
to  bidiagonal  form,  observe  that  after  each  pass  the  order  of  the  matrix  is  reduced  by  (w  -  k  - 
1),  while  after  f— passes  the  bandwidth  is  decreased  by  k.  Hence  the  elimination  of  q 
subdiagonals  takes  time 


q/k  *»/(»-!-/*) 

Tk-^2(«>-U~l)k)  Y1  n-{i-\)(w-l-jk) 

-igi.-o-. 

q/k 

»<—«>  +  5",(‘ +»  E  • 


With 

the  last  term  becomes 


,  w - 1  w-1  q  t  w-1  , 


»/*  ,  ‘it  „ 

Ejr 


Thus, 


and 


3  1 

Ti  *  jn*q  +  -ng(w  -  q), 

r,  =  n!  +  in(«-g)  +  In*. 
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Consequently,  a  reduction  to  bidiagon&l  form  takes  0(n2w)  steps  for  k  =  1  and  0(nJ)  for  k  —  O(w). 

In  general,  reduction  to  bidiagonal  form  of  rectangular  matrices  A  G  Rm*n  requires  time  on 
the  order  of  v  (max{m,  n}),  where  v  ~  w  or  t;  =  w2  : 


m  >  n  Let  A  =  (AiAj)7",  where  A\  is  a  n  x  n  and  At  a  /  x  n  matrix,  and  /  =  m  -  n  >  0.  Since 
the  only  nonzero  elements  of  A2  constitute  subdiagonals,  their  removal  reduces  the  first 
dimension,  /,  of  A2  while  elimination  of  superdiagonals  does  not  affect  its  order.  Hence, 
0(vm:)  steps  are  needed,  resulting  in  a  n  x  n  matrix  for  the  singular  value  decomposition, 
m  <  n  Proceed  as  above  with  the  transpose,  AT,  rendering  a  computation  time  of  0(vn2)  as 
well  as  a  m  x  m  matrix. 


Sequential  Singular  Value  Computations 

Computation  of  singular  values  with  the  QR  algorithm  can  be  done  explicitly  and  implicitly. 
The  explicit,  straightforward,  approach  for  computing  the  singular  values  of  a  matrix  A  €  Rm*n 
consists  of  using  the  explicitly  shifted  QR-algorithm  to  find  the  eigenvalues  of  AT A,  as  illustrated 
in  Algorithm  SVE.  In  sequential  computations  the  origin  shift  s,-  is  an  eigenvalue  of  the  trailing 
principal  2x2  submatrix  of  Aj A,-. 


1.  Bi^AfAi  (Ao  =  A). 

2.  Compute  B,  -  a, 7  =  QiRi. 

3.  Determine  Bj+l  =  Qf  ftj  +  a,7. 

Algorithm  SVE.  One  Step  of  the  Explicitly  Shifted  QR-Algorithm  in  the  SVD  Decomposition  of  a 
Matrix  A. 


Orthogonal  equivalence  transformations  allow  both,  a  reduction  of  A  to  bidiagonal  form  before 
commencing  SVE  or  else  a  reduction  of  Bo  to  tridiagonal  form  after  step  1  of  SVE.  Let  A  G  Rm*n 
have  bandwidth  w,  l  =  min{m,n}  and  L  —  max{m,n}.  As  for  the  first  possibility,  the  bandwidth 
reduction  yields  a  bidiagonal  matrix  A'  G  Rfxl  in  time  0(w2L2).  Steps  1,2  and  3  deal  with  matrices 
of  bandwidth  2  or  3  in  time  0(1).  In  the  second  case,  matrix  multiplication  brings  about  an  n  x  n 
matrix  Ef  of  bandwidth  2w  —  1  in  time  0(w2L),  followed  by  a  reduction  to  tridiagonal  form  of  B1 
in  time  0(w2n2).  Steps  2  and  3  need  O(n)  operations.  Unless  /  =  n,  where  the  second  alternative 
requires  less  time,  the  first  one  is  to  be  preferred.  A  delayed  reduction  doubles  the  bandwidth, 
and  hence  the  hardware,  as  well  as  the  number  of  passes  through  the  loop  in  Algorithms  BWO 
and  BWK. 

Information  about  the  smaller  singular  values  of  A  may  be  lost  when  restricting  attention  to 
the  rounded  AT A.  The  following  method  avoids  the  explicit  formation  of  ArA. 

The  Golub- Reinsch  SVD  iteration  applies  a  variant  of  the  implicitly  shifted  QR  algorithm  to 
a  bidiagonal  matrix  A  G  RnXH  [GoRe7l]  (since  an  upper  bidiagonal  matrix  is  zero  below  the  nth 
row,  the  matrix  is  assumed  to  be  square). 

Find  Pi,  a  Givens’  rotation,  so  that 


( PiiAjAf  -  s,7))2l  =  0. 


Set 


Bi  =  AtPj 


and  compute  A;+i  by  reducing  Bi  to  bidiagonal  form  via  orthogonal  equivalence  transformations, 
requires  O(n)  arithmetic  operations.  The  algorithm  is  exhibited  in  detail  as  Algorithm  SVI  below. 


Aj"  Ai  —  QiRi, 
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then  Pj  has  the  same  first  column  as  Qi,  hence  B,-  differs  from  A,-  only  in  the  first  two  rows  and 
columns.  Accelerated  convergence,  associated  with  origin  shifts,  is  achieved  without  having  to 
subtract  the  shift  from  the  diagonal  and  later  restore  it  (as  in  the  explicitly  shifted  version)  but 
by  means  of  concentrating  the  effect  of  the  shift  a,-  in  the  matrix  P,-. 


1. 

2. 

3. 

4. 


5. 


Find  P{  so  that  (P,(Af/t;  -  a,/)) 21  =  0. 

Determine  B,-  =  A%Pj . 

=  B,\ 

For  j  =  2, 3, . . . ,  n  -  1  do 
begin 

4.1  Generate  a  premultiplication  Pjj-i  to  annihilate  the  element  of  By  in  position  (j,  j  -  1). 

4.2  Apply  Pjj—i  and  generate  a  fill-in  at  position  (/  —  l,j  +  1)  of  Pyy_iBy. 

4.3  Generate  a  postmultiplication  PjLl  ]+t  to  eliminate  the  element  in  position  (7  —  1  ,7  +  1) 
and  compute  By+j  =  ( Py j_i By ) PyC,  y+1 ,  while  causing  a  fill-in  at  position  (7  +  1./)  r* 


B 


end. 


j+i- 


A»+i  —  Bn. 


Algorithm  SVI.  One  Step  of  the  Golub-Reinsch  SVD  Iteration  for  Bidiagonal  Matrices  A. 


When  given  a  lower  bidiagona!  matrix  A  its  transpose  AT  will  be  considered,  as  the  singular  values 
of  A  and  AT  are  the  same.  For  a  diagonal  matrix  A  only  identity  rotations  are  formed. 

Roundoff  Error  in  Algorithms  BWO  and  BWK 

Not  only  the  time  complexity  but  also  a  roundoff  error  analysis  indicates  that  BWK  is  to  be 
preferred  over  BWO,  since  fewer  rotations  are  generated  and  hence  higher  accuracy  is  attained. 

Gentleman  [Gent75)  has  shown  that  the  application  of  plane  rotations  to  a  matrix  A  6  RnXn 
has  a  roundoff  error  bounded  above  by 

r?a  ||  A|F(l  +  f>)*-1, 

where  t\  is  a  multiple  of  the  machine  error  and  s  is  the  maximal  number  of  rotations  applied  to 
any  element.  The  following  observations  determine  a  for  BWK  and  BWO. 

BWK  :  In  step  1  of  BWK  each  row  is  affected  by  at  most  2k  different  premultiplications  (k  to 
eliminate  elements  in  that  very  row  and  k  for  the  row  beneath).  In  step  2  an  element 
of  a  row  is  affected  by  at  most  2k  postmultiplications,  giving  a  total  of  4k  rotations  per 
element  per  loop  traversal.  There  are  ^  traversals,  so 

4  kn 

3  bwk  = - : — r- 

w  —  1  -  k 

BWO  :  The  larger  the  bandwidth  the  smaller  the  percentage  of  elements  which  contribute  to 
the  rounding  error.  Hence,  the  worst  case  (generation  of  rotations  in  adjacent  rows  or 
columns)  occurs  in  BWO  when  reducing  a  matrix  with  q  =  1  and  p  =  2  to  bidiagonal 
form.  By  an  argument  similar  to  the  previous  one,  each  element  is  altered  by  at  most  4 
rotations.  To  eliminate  a  codiagonal  pair  n  -  1  loop  traversals  are  necessary  resulting  in 

*bwo  —  4  (n  -  1). 
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For  matters  of  comparison  assume  that  k  codiagonals  are  removed  :  BWO  annihilates  k/2  sub¬ 
diagonals  and  k/2  superdiagonals,  while  BWK  deletes  k  subdiagonals  or  k  superdiagonals.  BWO 
renders  higher  accuracy  than  BWK  only  if 


2k(n  -  1)  =  -3bwo  <  sbwk  — 


4krt 

w-l-k' 


which  is  the  case  for 

«  -  3 - -  <  k, 

n  -  I 

namely  a  reduction  to  bidiagonal  form, 

»Bwo  =  *bwk  ~  2(«o  -  2)(n  +  1), 
or  a  transformation  to  tridiagonal  form 

sbwo  =  aBWK  -  2 (to  -  3). 

The  above  comparison  presumes  BWK  to  remove  all  codiagonals  at  once  implying  that  A  has 
q  =  to  —  2orp  =  w-  l.  In  all  but  two  situations  BWK  is  executed  separately  for  subdiagonals 
and  superdiagonals  and  thus  more  accurate  than  BWO. 

Processors 

The  processors  employed  are  simple  extensions  of  cells  1,  T,  2,  2’  and  3  in  [Help83,  Ipse83] 
with  the  same  data  flow.  Processor  Bl  generates  and/or  applies  rotations  while  B2  only  applies 
rotations.  Cell  B  is  a  conglomerate  of  two  Bl  and  two  B2  cells,  whereas  S  is  needed  to  direct 
(shift)  data  to  appropriate  processors.  In  addition,  each  processor  is  augmented  with  two  one-bit 
data  lines  which  determine  the  direction  of  dataflow  elemnts 

left  ss  1  shifting  of  matrix  elements  to  the  left  for  subdiagonal  elimination  (premultiplications). 
right  =  1  shifting  of  matrix  elements  to  the  right  for  superdiagonal  elimination  (postmultiplica¬ 
tions).  Processors  Bl  and  B  are  additionally  attached  to  a  third  bit  line,  on.  The  three 
bit  lines  determine  the  task  performed  by  a  processor. 

In  particular,  cell  Bl  generates  rotations  if  on  =  1  (behaving  like  cell  1  or  1’)  and  applies 
them  otherwise  (cell  2  or  2’).  If  left  =  1  then  the  rotations  generated  are  premultiplications  (as  in 
cell  1,  or  2)  and  postmultiplications  if  right  «  1  (cell  1’,  or  2’),  see  Figure  2.  Cell  B2  propagates 
rotations  to  the  right  and  matrix  elements  to  the  left  if  left  =  1  (like  cell  2)  and  if  right  =  1 
rotations  to  the  left  and  elements  to  the  right,  shown  in  Figure  3.  Figure  4  depicts  cell  S  which 
directs  matrix  elements  straight  upwards  with  a  delay  of  one  step  if  on  =  0  (i.e.,  it  functions  as 
cell  3  with  identity  rotations  (Help83j).  If  on  =  1,  it  displaces  them  by  one  cell  to  the  left  if 
left  =  1  (similar  to  cell  2  with  identity  rotations  [Help83])  and  to  the  right  if  right  =  1  (cell  2’ 
with  identity  rotation).  Figures  5,  6  and  7  depict  I/O  format  and  data  flow  through  linear  meshes 
for  subdiagonal  elimination  (left  =  1),  superdiagonal  elimination  (right  —  1)  and  shifting  upwards 
(left  =  right  =  0).  The  former  are  identical  to  the  ones  in  [Help83j. 

Cell  B  is  crucial  tor  the  correct  execution  of  algorithms  BWO  and  SVI,  since  it  assures  consis¬ 
tent  application  of  rotations  as  follows.  The  reduction  of  a  matrix  A  by  ^re-  and  postmultiplications 
results  in  a  matrix  B,  where 


and  P<j  and  Q,  j  are  Givens’  rotations.  Parallelism  is  exploited  by  taking  advantage  of  associativity 
in  matrix  multiplication  :  it  is  immaterial  whether  premultiplications  or  postmultiplications  are 
performed  first ,  however 
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1.  the  order  of  premultiplications  in  fij)  &nd  postmultiplications  in  ^Il»j  Q>\f)  must  be 
preserved, 

2.  if  Pij  and  Qk,t  are  applied  at  the  same  time,  they  must  not  affect  common  elements;  when 
acting  on  the  same  elements  (at  different  times),  one  must  be  completed  before  the  other 
commences. 

In  order  to  ascertain  condition  2,  B  handles  application  of  a  rotation  to  three  different  element 
pairs  at  the  same  time  and  thus  enables  premultiplications  and  postmultiplications  to  take  place  in 
succeeding  time  intervals;  so  fill-in  is  generated  in  one  before  being  removed  in  the  next  step.  It  joins 
two  linear  meshes  and  forwards  postmultiplications  from  top  to  bottom  an^  premultiplications  from 
bottom  to  top.  The  signals  oupre  and  onpos  determine  whether  rotations  are  to  be  generated 
and  if  so,  of  which  type  they  are. 

For  ease  of  understanding  (not  necessarily  conforming  to  physical  reality)  the  data  flow  ‘inside’ 
cell  B  will  be  depicted  as  similar  to  one  in  a  combination  of  cells  B1  and  B2.  Figure  8  presents 
the  programme  of  cell  B  along  with  an  illustration  of  the  element  pairs  that  are  affected.  Diagonal 
elements  are  input  to  the  lower  left  comer  and  elements  of  the  first  superdiagonal  to  the  lower 
right.  The  neighbouring  processor  to  the  lower  left  delivers  elements  of  the  first  subdiagonal  and 
the  one  to  the  lower  right  (if  present)  elements  of  the  second  superdiagonal.  Five  time  steps  are 
necessary  for  a  particular  value  to  ‘traverse’  the  processor.  Because  a  diagonal  element  is  the  first 
element  of  a  matrix  to  enter  the  array  under  the  current  I/O  format,  a  postmultiplication  (POS) 
takes  place  before  a  premultiplication  (PRE),  either  of  which  may  be  identity. 
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Zoai 


{Internal  registers  are  P,  x  and  y} 
rights  :=  rightin ; 

5—  Ic/tm! 

»  :=  Vm; 

If  on  —  1  then 

If  It f  Un  -  1  then 

begin  {Generate  premultiplications} 
x zr;  {Input} 

(S)  :=  {Equations  (PI)} 

P,  P ;  zi  0;  {Output} 

end 

else  {rightin  =  1} 

begin  {Generate  postmultiplications} 
x :*»  *<;  {Input} 

(x  0)  :=  (x  y)PT;  {Equations  (PI)} 
Pi  :«  P;  Zr  0;  {Output} 

end 

else  {on  =  0} 

If  Uftin  =  1  then 

begin  {Propagate  premultiplications} 

P  :=  Pi;  x  :=  z,;  {Input} 

(J)  “  pO'<  {Equations  (P2)} 

Pr  :=  P;  ZJ  :«  y;  {Output} 

end 

else  {righUn  =  1} 

begin  {Propagate  postmultiplications} 

P  :=  Pr;  x  :=  zi;  {Input} 

(x  y)  :=  (x  y)PT ;  {Equations  (P2)) 
P  :=  P;  2,  :=  y;  {Output} 

end 

*0*1  *• 


Figure  2.  Cell  Bl. 


xout 


«r 

rights 

lefto* 

Pr 


y.n 

{Internal  registers  are  P,  x  and  y} 

right :=  rights; 

Icftont  'a  ^/^inl 

y  :=  vin ; 

if  It  ftin  -  1  then 

begin  {Propagate  premultiplications} 

P  :=  Pl;x  zT;  {Input} 

(*)  “  /’(*);  {Equations  (P2)} 

Pr  :=  P;  zi  y;  {Output} 

end 

else  {rsyAt,-„  =  1} 

begin  {Propagate  postmultiplications} 

P  :=  pr;  i ;»  zt;  {Input} 

(x  y)  :=  (x  y)Pr;  {Equations  (P2)} 
Pi  :=  P;  Zt  :=  y;  {Output} 

end 

*<»«  :=  *• 


Figure  3.  Cell  B2. 


{Internal  registers  are  x  and  g) 
right  at  :=  right  in\ 

It  flout  it  f tin 

Uleftin  « 1  than 

begin  {Shift  left) 

torn t  :=  n  :*  Vin\ 

•nd 

•1m  If  rigktin  =  1  than 

begin  {Shift  right} 

Zoai  :a  Zi\ iT  t* 

•nd 

aba  {right iH  -  Ufa*  *  0} 
begin  {Shift  upwards} 

*<mt  :■*;*  :*»  * 

•nd 


Figure  4-  Cell  S. 
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(a)  Input/Output  Format. 

Figure  6.  Linearly  Connected  Mesh  for  Superdiagonal  Elimination  (left  m  0,  right  =  1,  #  =  5) 
and  Input  p  —  q  =*  2. 
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i  e 


onpRE 


2  out 

-LJ= 


*i 

*2  yi 


onpos 
■*  !fo«( 

-  Pin 


Pont  J  y$ 


■rj  y2 


T“ 

y«n 


T 


{Internal  registers  are  P,  x,-  and  y,-,  1  < »  <  3  } 
*s ;!=  *«n!  ys  ■—  y«»; 

if  onpos  =  1  then  {Generate  rotations) 

(xi  0)  :=  (x,  yi)Pr; 
else  {Input  rotations) 

P  :=  Pin ; 

/*i  *3  **\  13  *i\p7\ 

Vfi  S3  »a'  ‘  'in  »  ifa-'  ’ 

atont  '•=  *»;  Vont  :*  yi;  Pont  :■  P;  {Output) 

(2  ID  :=  (2  *!).  {Transpose) 


(a)  Postmultiplieation  (POS). 
Figure  8.  Cell  B. 
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Xo*t 


onpRE 


onpos 


xin 

Pout 


{Internal  registers  are  P,  2,-  and  y,-,  1  <  »  <  4} 
**  ■-  *.•«;  Vi  •'=  «/»•»; 

If  onpRE  —  1  then  {Generate  rotations} 

(*<>’)  =-  *>(£); 
else  {Input  rotations) 

P  :=  ft.; 

/Xj  X.  X«\  p/*2  *•  **\. 

vr.  y>  n'  ‘  'yj  y.  y«'’ 

*<*.«  :=  *2;  S/0.1  :=  s/2;  Pout  :=  P;  {Output} 

(»i  £)  :=  (£  *)•  {Transpose  and  rename} 


(b)  Premultiplication  (PRE). 
Figure  8.  Continuation. 


A  Systolic  Array  for  Algorithm  BWO 

The  concurrent  order  for  the  generation  of  plane  rotations  is  described  in  [SaKu78,  Help83j. 
Consider  subdiagonal  annihilation  :  after  the  ith  element  of  the  q  —  j  +  1st  subdiagonal  has  been 
removed,  rows  1  •  • .  q  +  J  +  •  —  2  will  not  be  affected  by  further  eliminations  in  this  subdiagonal 
and  removal  of  the  ith  element  of  subdiagonal  q  -  j  can  commence,  l<j<k,  l<i<n  —  q  — 
i  +  1.  Annihilation  of  subdiagonal  elements  in  corresponding  positions  is  staggered  so  as  to  avoid 
interference  of  rotations.  The  analogue  holds  for  superdiagonal  eliminations. 

In  compliance  with  Algorithm  BWO  the  array  removes  the  leading  elements  of  the  outermost 
sub-  and  superdiagonal  along  with  the  ensuing  fill-in.  It  consists  of  two  linear  meshes,  joined 
through  cell  B  (which  in  view  of  area  and  complexity  accounts  for  four  cells),  bringing  the  total 
number  of  processors  to  2w  -  3.  The  w  -  1  processors  in  the  bottom  mesh  are  B2  cells.  The  ones 
to  the  left  of  cell  B  have  right  ■  1  and  to  the  right  left  =  1.  When  q  >  2  and  p  >  3  the  two 
leftmost  and  rightmost  cells  in  the  upper  mesh  are  Bl  cells,  all  others  B2  and  oripre  —  oripott  =  0 
for  cell  B  (rotations  are  generated  by  the  Bl  processors  not  B).  Cells  to  the  left  of  B  have  left=  1 
and  to  the  right  of  B  right  a  1.  The  inner  Bl  cells  remove  the  leading  nonzero  element  of  an 
outermost  codiagonal  (subdiagonal  q  or  superdiagonal  p)  by  generating  an  appropriate  rotation 
(on  =*  1).  For  all  subsequent  elements  they  forward  rotations  (on  a  0),  generated  by  the  outer 
Bl  cells,  to  chase  the  bulge  from  subdiagonal  q  +  1  or  superdiagonal  p  +  1.  Figure  9  presents  a 
bandwidth  reduction  array  with  I/O  format  for  $  =  4  and  »  =  8  while  Figure  10  displays  the  data 
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flow  for  the  example  ?  =  2  and  £>  =  6.  The  cases  q  <  2  or  p  <  3  are  treated  with  an  array  similar 
to  the  one  used  for  the  singular  value  decomposition,  where  B  generates  rotations  and  some  of  the 
B1  cells  are  dispensable. 
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Figure  9.  Bandwidth  Reduction  Array  for  *  =  8,  9  —  4,  p  =  5. 
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A  traversal  through  the  array  is  done  in 

T(n,2)  =  2n  +  3 

steps.  If  the  leading  row  and  column  is  deflated  after  each  pass  through  the  array,  the  number  of 
steps  for  removal  of  the  qrth  subdiagonal  and  pth  superdiagonal  comes  to 


n— f  n— r 

£T(n-i  +  l,2)  =  £2(n-.'  +  3)-l 
i=i  i=i 

=  (n  -  r)(2n  +  5)  -  (n  -  r)(n  -  r  +  1) 

=  n2  -  r2  +  O(n), 

where  r  —  min{p,g}.  As  for  the  generation  of  rotations,  the  left  inner  Bl  cell  has  on  =  1  at 
t  -  1 . . .  q  +  2  and  on  =*  0  thereafter,  when  on  —  1  for  the  outer  Bl  cell.  Similarly,  the  right  outer 
Bl  cell  is  set  to  on  =  1  at  /  =  1 . .  .p  +  2,  after  which  on  becomes  0  and  the  outer  Bl  processor 
generates  rotations. 

If  k  <  n  -  r  such  arrays  are  chained  and  for  simplicity  the  k  leading  rows  and  columns  are 
deflated  after  each  pass  through  the  k  arrays,  the  removal  of  a  codiagonal  pair  takes  time 


{n-r)/k  [n—r)/k 

£  T(n  -  («  -  l)k,  2k)  =  2(«  “  +  3t)  -  1 

=  2%>  +  <^+0(n)  =  0(£). 

The  setting  of  bit  lines  now  also  depends  on  the  position  of  the  array  in  the  k  array  network. 

Only  if  the  number  of  arrays,  k,  is  proportional  to  the  order  of  the  matrix,  is  the  resulting 

computation  time  linear  in  n.  A  reduction  to  bidiagonai  form  takes  O  steps.  For  small  k 

this  represents  a  reduction  in  execution  time  of  order  w  as  compared  to  the  sequential  case. 

The  last  (reduction  from  w  =  3  or  4  to  w  =*  2)  in  a  stage  of  bandwidth  reduction  arrays  is 
nearly  identical  to  the  array  implementing  Algorithm  SVI.  The  resulting  agreement  in  hardware 
features  and  execution  speeds  of  both  arrays  greatly  facilitates  their  chaining.  Unfortunately, 
within  a  bandwidth  array  the  complexity  of  different  cell  types  varies  greatly  and  so  do  their 
execution  times.  If  t,  is  the  time  for  cell  x  to  complete  one  step,  then 


2s  ^  *B- 


As  opposed  to  the  SVI  array  which  basically  consists  of  one  processor  that  generates  rotations  in 
each  step,  2w  —  4  simpler  processors  work  at  the  speed  of  one  big,  complex  cell.  Moreover,  on 
account  of  the  large  area  occupied  by  cells  Bl  and  B,  the  ratio  of  bandwidth  to  area  becomes  only 
appreciable  for  large  band  widths.  Lastly,  on  account  of  the  fixed  positions  for  those  cells  not  only 
w  =*  «  is  necessary  but  also  p  —  p  and  <7  =  <J.  Thus,  max{p,q}  differently  sized  arrays  must  be 
available  in  order  to  accomodate  the  decreasing  bandwidths. 


Linear  Meshes  for  Algorithm  BWK 

The  design  for  algorithm  BWK  requires  simpler  hardware,  already  introduced  in  [He!p83], 
than  that  for  BWO  and  allows  processing  of  arbitrarily  sized  matrices.  Two  types  of  linear  meshes, 
each  comprising  9>  cells,  are  employed;  one  performs  actual  computations,  i.e.,  elimination  of 
codiagonals,  and  a  second  one  properly  positions  (shifts)  matrix  elements. 
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A  computation  mesh  consists  of  (i>  -  2)  B1  and/or  B2  ceils,  enclosed  by  a  Bl  cell  to  the  right 
and  to  the  left.  Each  mesh  contains  at  most  one  Bl  cell  with  on  =  1.  The  computation  mesh,  an 
extension  of  the  one  in  (Help83]  and  obtained  by  replacing  cells  1  and  1’  by  Bl  and  2  and  2’  by 
B2,  can  perform  both,  pre-  and  postmultiplications;  which  of  the  two,  is  determined  by  the  value 
of  the  bits  left  and  right,  and  on  in  the  Bl  cells. 

If  left  —  l  a  mesh  performs  premultiplications  (Figure  5)  and  is  called  a  QR  mesh  (9  such 
meshes  combined  effect  a  QR  decomposition).  If  the  matrix  elements  are  routed  so  that  the 
outermost  subdiagonal  enters  a  Bl  cell  with  on  *  1,  this  subdiagonal  is  eliminated,  the  resulting 
rotation  propagated  to  the  left  and  thus  a  new  superdiagonal  (fill-in)  generated.  Because  the 
bandwidth  is  preserved,  the  remaining  elements  are  displaced  one  cell  to  the  left  before  output 
from  the  mesh.  Identity  rotations  are  generated  for  a  zero  outermost  subdiagonal.  A  Bl  cell  with 
on  s  1  must  be  among  the  «  -  (w  -  1)  leftmost  celts  of  the  mesh  to  provide  for  enough  space  to 
the  right  for  the  other  —  1  codiagonals.  If  all  Bl  cells  have  on  =  0,  nonidentity  rotations  may 
be  input  to  the  leftmost  Bl  cell,  which  applies  and  then  forwards  them  as  before. 

If  right  =  1,  the  mesh  performs  postmultiplications  (Figure  6)  and  is  called  QL  mesh  (p  of 
those  accomplish  a  QL  factorisation).  The  outermost  superdiagonal  is  annihilated  when  entering  a 
Bl  cell  with  on  =  l,the  resulting  rotations  forwarded  to  the  left  and  another  subdiagonal  is  formed. 
Again,  preservation  of  bandwidth  dictates  that  the  remaining  elements  be  displaced  one  cell  to  the 
right  before  leaving  the  mesh.  A  zero  superdiagonal  causes  formation  of  identity  rotations.  To 
accomodate  the  (€>  -  1)  codiagonals  to  the  left  of  it,  a  Bl  cell  with  on  =  1  must  be  among  the 
rightmost  €>  —  (w  —  1)  cells  of  the  mesh. 

The  cases  left  =  right  —  1  or  left  =  right  =  0  are  not  allowed  to  occur.  As  a  matrix  elements 
is  displaced  to  either  the  left  or  right  before  exiting  the  array,  it  needs  three  time  steps  to  traverse 
the  mesh. 

A  linear  shift  mesh  consists  of  S  cells  (Figure  7).  If  left  =  1  matrix  elements  are  shifted  one 
cell  to  the  left  (as  in  a  QR  mesh  with  identity  rotations),  or  one  to  the  right  for  right  =  1  (cf. 
the  QL  mesh  with  identity  rotations).  Here  too,  the  situation  left  *  right  —  1  is  not  permitted. 
However,  left  =  right  ~  0  causes  shifting  of  elements  straight  upwards  in  that  particular  mesh. 
Because  of  the  delay  element  in  S  cells,  the  traversal  time  is  the  same  as  for  a  linear  mesh,  thereby 
ascertaining  that  both  mesh  types  work  in  a  synchronised  manner. 

A  Systolic  Array  for  Algorithm  BWK 

An  array  (‘module’)  is  constructed  which  executes  one  pass  through  the  loop  of  BWK;  during 
one  module  traversal  the  ( w-k-l )  leading  elements  of  the  £  outermost  subdiagonals  are  removed. 
For  the  removal  of  1  <  k  <  q  subdiagonals,  [£/£]  passes  through  the  module  are  necessary.  Hence, 
£  <  £  may  be  assumed  for  the  subsequent  analysis;  moreover,  as  will  be  explained  later,  9  >  »  +  £ 
is  required  (otherwise  the  matrix  must  be  partitioned,  for  according  strategies  see  [Ipse83]). 

A  module  consists  of  four  sets  of  linear  meshes.  The  £  QR  meshes  at  the  bottom  of  the  module, 
where  input  is  entered,  are  followed  by  a  group  of  shift  meshes  that  direct  matrix  elements  to  the 
proper  positions  for  input  to  the  £  QL  meshes.  These  in  turn  are  succeeded  by  another  group  of 
shift  meshes  which,  upon  output,  will  have  placed  matrix  elements  in  the  positions  assumed  at 
input  to  the  module.  A  sketch  of  the  module  is  found  in  Figure  11. 

The  QR  part  causes  removal  of  k  subdiagonals  along  with  fill-in  of  k  superdiagonals.  The 
QL  part  erases  this  fill-in  and  restores  the  k  previously  eliminated  subdiagonals  except  for  their 
(w  -  k  -  1)  leading  nonzero  elements.  Furthermore,  the  outer  codiagonals  are  eliminated  by  the 
lower  meshes  while  the  inner  ones  are  removed  by  the  upper  meshes.  More  formally,  mesh  j  of 
the  QR  part  removes  the  (9  -  /  +  l)st  subdiagonal,  £  -  £  +  1  <  /  <  £,  by  appending  (£  -  k)  zero 
subdiagonals  to  the  left  of  the  band  (hence,  for  k  *  0,  «  >  w  +  £  is  required).  Then  subdiagonal 
(9  -  k)  will  emerge  from  the  leftmost  cell  of  the  top  mesh  in  the  QR  part.  Similarly,  superdiagonal 
(p  +  £  —  j  +  1)  is  removed  in  mesh  j  of  the  QL  part,  £-£  +  1  <  j  <  £,  and  superdiagonal  p 
appears  in  the  rightmost  Bl  cell.  In  case  the  number  of  subdiagonals  to  be  removed  is  less  than 
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the  number  of  meshes,  Jb  <  Jb,  the  lower  (k  -  k)  meshes  of  the  QR  and  QL  parts  do  not  generate 
rotations  (on  =  0  in  all  their  Bl  cells).  Tables  I  and  2  (with  w*®)  show  processor  time  schedules 
for  the  QR  and  QL  parts. 

On  account  of  the  shift  groups,  input  and  output  patterns  of  a  module  are  identical,  so  chaining 
of  modules  is  feasible.  It  remains  to  be  determined  how  many  meshes  are  to  be  present  in  a  shift 
group.  Assuming,  that  data  alignment  is  left  justified  in  the  QR  and  right  justified  in  the  QL  part, 
Tables  1  and  2,  with  w  =  ft,  show  the  diagonal,  for  instance,  to  leave  cell  (Jb  —  k)  +  (g  +  1)  —  t  of 
mesh  Jb  in  the  QR  part.  For  proper  superdiagonal  elimination  it  has  to  enter  cell  (ft  -  (Jb  -  Jb)  -  p) 
in  mesh  1  of  the  QL  group.  Since  ft  >  tv,  the  first  shift  group  has  to  direct  matrix  elements  to  the 
right.  Analogously,  cell  (ft  +  Jb — p)  in  the  top  mesh  of  the  QL  part  delivers  the  diagonal,  which  has 
to  return  to  its  original  position  in  order  to  enter  cell  (k  -  k)  +  (9+  1)  in  the  QR  part.  Hence,  the 
second  group  performs  shifts  to  the  right.  In  both  cases,  data  have  to  be  displaced  by  the  same 
number  of  places,  (ft  -  tv)  -  (k  —  Jb). 

For  matching  hardware  and  input  dimensions,  k  =  Jb  and  ft  =  w  +  k,  the  displacement  comes 
to  Jb,  the  difference  between  tv  and  ft.  However,  for  ft  >  to  +  £  the  difference  between  tv  and  ft 
can  be  arbitrary  and  so  the  amount  of  displacement.  To  avoid  data  dependent  control  or  hardware 
dimensions,  both  shift  groups  are  modified  as  follows  :  a  Bl  cell  occupies  position  t’Jb  +  1,  t  >  0,  in 
every  mesh.  To  avoid  idle  cells,  ft  =  cJb+ 1  for  some  l  >  1  bringing  the  number  of  Bl  cells  per  mesh 
to  i  + 1.  Instead  of  guiding  elements  toward  the  right  boundary,  the  second  shiftgroup  now  directs 
elements  toward  the  nearest  possible  Bl  cell  which  is  never  more  than  k  cells  away.  On  account  of 
the  initial  left  justified  data  alignment,  the  first  shift  group  still  shifts  toward  cell  1.  Consequently, 
there  are  k  meshes  in  each  shift  part  and  each  element  is  shifted  by  ((ft  -  tv)  -  (Jb  -  Jb))  mod  (Jb+ 1). 
The  preceeding  ideas  are  applied  to  an  example  in  Figure  12.  Processor  time  schedules  for  all  four 
groups  are  given  in  Tables  1  to  4,  w  —  ((ft  -  tv)  -  (Jb  -  Jb))  mod  (Jb  +  1)  in  Table  2.  Hence,  the 
number  of  steps  for  the  traversal  of  a  4Jb  x  (cJb  +  1)  processor  module  amounts  to 

r(n,  4Jb)  =  2(r»  +  4k)  -  1 . 

For  the  sake  of  simpler  control  deflation  may  be  omitted,  since  only  identity  rotations  are 
generated  for  leading  zero  elements.  Otherwise  the  leading  (w  -  Jb  -  1)  rows  and  columns  are 
removed  after  pass  through  the  module. 
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Figure  12.  Schematised  Module  for  Subdiagonal  Elimination,  k 
q  =  k  =  3,  p  =  2  and  w  —  6. 
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Assume, 

1.  1  <  k  <  q,  k  <  k  and  w  +  t  < 

2.  element  1  of  the  diagonal  enters  at  time  t  =  1; 

3.  Uj  s»  (2i  -  1)  +  2(y  -  1); 

4.  x  —  k  —  k. 

Table  1.  Processor  Time  Schedule  for  Meshes  in  the  QR  Group  of  a  Module  for  Subdiagonal 
Elimination. 
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Assume, 

1.  I  <  k  <q,  k  <Je  and  w  +  k  < 

2.  element  1  of  the  diagonal  enters  at  time  t  =  1; 

3.  »  (2.  -1) +  2(>-l); 

4.  *  =  *  -  (k  -  k)  -  k; 

5.  ii  =  »  or  ((»  -  w)  -  (t  -  fc))  mod  Jr  +  1. 

Table  2.  Processor  Time  Schedule  for  Meshes  in  the  QL  Group  of  a  Module  for  Subdiagonal 
Elimination. 
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Assume, 

1.  \  <  k  <q,  k  <i  and  v  +  i  <  €>; 

2.  element  1  of  the  diagonal  enters  at  time  t  =  1; 

3.  Uj  «(2i-I)  +  2(/-l); 

4.  meshes  k  -  z  +  1, ....  4  shift  to  the  right,  all  others  upwards,  0  <  z  <  k. 

Table  3.  Processor  Time  Schedule  for  Meshes  in  the  First  Shift  Group  of  a  Module  for  Subdiagonal 
Elimination. 
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Assume, 

1.  \<k<q,  k<k  and  w  +  h  <  *; 

2.  element  1  of  the  diagonal  enters  at  time  t  =  1; 

3.  ty-  (2.  -1)  +  20  -1); 

4.  meshes  k  -  z  +  1, . . . ,  £  shift  to  the  right,  all  others  upwards,  0  <  z  <  i. 


Table  4.  Processor  Time  Schedule  for  Meshes  in  the  Second  Shift  Group  of  a  Module  for  Subdiag 
onal  Elimination. 


Programming  of  a  Module 

Given  a  module  for  subdiagonal  elimination  with  parameters  l  and  k,  the  following  require¬ 
ments  must  be  satisfied  :  _ 

®  =  cJ  +  l,  JE>1,  e  >  1. 

Meshes  1 . . .  Jfc,  2t  +  I . .  .3 k  :  cell  ik  +  1  is  a  B1  cell,  all  others  are  B2  cells,  0  <  «  <  f. 

Meshes  k  +  1  ...2k, 3k  +  1...4JE  :  all  cells  are  S  cells. 

The  initial  (default)  state  of  the  module  is  a  follows  : 

1.  Bit  Data  Lines 

Meshes  1 . . .  k  (QR  group)  :  left  =  1,  right  =  0. 

Meshes  k  +  1 . .  .2k  (1.  shift  group)  :  left  =  right  =  0. 

Meshes  2Jfc  +  1 . .  .34  (QL  group)  :  left  ~  0,  right  =  1. 

Meshes  3k  +  1 . . .  4k  (2.  shift  group)  :  left  =  right  =  0. 

Meshes  1 . . .  k,  2k  +  1 . . .  34  :  cells  ik  +1  (Bl  cells),  1  <  i  <  ?,  have  on  =  0. 

2.  Matrix  and  Rotation  Lines 

Mesh  1  :  all  matrix  input  lines  are  0. 

Meshes  1 . . .  k  :  the  left  rotation  line  of  cell  1  contains  the  identity  rotation. 

Meshes  2k  +  1 . .  .3k  :  the  right  rotation  line  of  cell  ii  contains  the  identity  rotation. 

Meshes  1 . . .  k,  3k  +  l . . .  4k  :  the  right  matrix  input  to  cell  is  0. 

Meshes  k+  l...  3k  :  the  left  matrix  input  to  cell  1  is  0. 

If  the  input  matrix  satisfies  *  <  w  +  k,  it  is  partitioned  into  submatrices  of  size  ft  x  n',  where 
n'  =  [(«-i+l)/2j;  for  details  see  (lpse83).  Thus,  assume  w  >  w+k.  The  programme  in  Figure  13 
describes  elimination  of  subdiagonals.  If  superdiagonals  are  to  be  annihilated,  the  functions  of  For 
superdiagonal  elimination  the  groups  QR  and  QL  are  exchanged,  as  well  as  the  shift  groups  1  and 
2  by  accordingly  modifying  the  bitlines  left,  right  and  on. 
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A  Systolic  Array  for  Algorithm  SVE 

One  iteration  of  SVG  for  bidiagonal  matrices  is  implemented  by  taking  advantage  of  simple 
arrays  for  matrix  transposition  [Ipse83],  to  be  discussed  in  the  next  section,  and  QR  decomposition 
[Help83,  Ipse83],  Formation  of  transposes  'on  the  fly’  enhances  the  degree  of  pipelining  and 
avoids  time  consuming  storage  of  data  during  a  computation  for  the  purpose  of  changes  in  address 
patterns. 

The  initial  formation  of  the  bulge  (step  1,  i  =  0)  is  done  with  the  help  of  a  four  processor 
matrix  transposition  array  (Figure  14)  and  a  four  processor  matrix  multiplication  array  [KuLe78] 
in  2n  -  1  and  3n  +  2  steps,  respectively.  Steps  2  and  3  employ  a  linear  three  processor  QR 
decomposition  array  (Figure  15)  with  one  B1  and  two  B2  cells.  Its  inputs  are  (B,  -  s,-I)  and  Rf, 
respectively,  and  the  execution  time  is  2(n  -  1).  Cell  B1  generates  rotations  in  step  2  and  forwards 
the  recycled  rotations  in  step  3.  Formation  of  the  transpose  Rj  before  step  3  and  Bj  after  step  3 
is  accomplished  with  fifteen  and  twelve  processor  networks  (Figure  16)  in  2(n  +  6)  and  2(n  +  4) 
steps,  respectively.  Because  of  hardware  requirements  for  large  bandwidths,  to,  as  well  as  demands 
on  accuracy,  the  implicitly  shifted  method  is  recommended. 
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(a)  Upper  Triangular  Matrices. 

Figure  16.  Matrix  Transposition  Arrays  for  Matrices  of  Bandwidth  w  =  3. 
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(b)  Tridiagonal  Matrices. 
Figure  16.  Continuation. 
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Systolic  Arrays  for  Matrix  Transposition 

The  transpose  of  a  matrix,  in  particular  a  banded  matrix,  can  be  obtained  by  reflecting  the 
codiagonais  about  the  diagonal,  whereby  the  ordering  between  successive  elements  of  a  codiagonal 
is  preserved.  That  »,  the  ith  subdiagonal  becomes  the  ith  superdiagonal,  1  <  i  <  q,  and  the 
jth  superdiagonal  turns  into  the  jth  subdiagonal,  1  <  j  <  p.  The  square  array  of  OUv2)  delay 
elements  has  a  low  hardware  cost  for  matrices  of  small  bandwidth,  in  contrast  to  0(rr)  cells  for 
dense  matrices  (the  case  w  ■  2  is  an  exception,  as  only  2  not  4  cells  are  needed). 
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Four  kinds  of  delay  elements,  shown  in  Figure  17,  direct  the  matrix  elements.  A  pair  of  cells, 
(L,  R),  acts  as  a  2  x  2  crossbar  switch  if  both,  L  and  R,  receive  their  inputs  at  the  same  time.  As 
illustrated  in  Figure  18  for  the  case  w  —  6,  the  L  and  R  cells  are  arranged  in  a  chequered  fashion  as 
a  square  orthogonally  connected  w"1  processor  array.  At  the  left  and  right  edges  (I/O  takes  place 
at  the  bottom  and  top  edges)  where  they  do  not  pair,  a  S  cell  (as  before,  but  left  =  right  =  0, 
Figure  17)  is  used  instead.  For  the  (L,  R)  pair  to  properly  function,  data  must  be  delivered  and 
output  by  non-time  skewed  codiagonals.  It  takes  two  time  steps  to  traverse  a  mesh  and  values  may 
are  input  no  sooner  than  every  other  cycle.  This  implies  that  the  ith  element  of  each  codiagonal 
enters  the  the  jth  linear  mesh  at  time 

T(i,j)  -  1  «  2(i  +  j  -  1)  -  1,  1  <  i  <  n,  1  <  j  <  w, 

bringing  the  total  traversal  time  to 

T(n,w)  =  2(n  +  «)  -  1. 

Absent  values  at  the  end  of  the  ‘shorter’  codigonals  are  represented  by  a  default  value  of  zero.  The 
array  size  depends  only  on  w  not  on  the  particular  values  of  p  and  q. 

In  case  of  a  disagreement  between  matrix  and  hardware  dimensions  let  the  array  be  of  di¬ 
mension  x  i>.  If  the  bandwidth  of  the  matrix  is  too  small,  v  <  ®,  then  prior  to  input  -  w 
codiagonals  are  arbitrarily  attached  to  the  outside  of  the  band  and  removed  on  output.  For  the 
opposite,  *  <  w,  there  is  only  a  simple  solution  if  a  €>  x  k  array  is  available  and  k  >  w  is  a  multiple 
of  The  transpose  is  available  after  f/f/€>]  passes  through  the  array. 

A  non-time  skewed  codiagonal  address  pattern  calls  for  a  pre-  and  postprocessor  to  ensure 
compatibility  with  the  codiagonal  wise  I/O  pattern  of  all  the  other  arrays  (Figures  14  and  16). 
Let  k  be  the  larger  of  p  and  q.  Consisting  of  0  cells  the  preprocessor  has  the  ith  codiagoual 
traverse  k-i  cells  upwards  (A:  for  the  diagonal)  so  corresponding  elements  of  all  codiagonals  enter 
the  transposition  array  at  the  same  time.  Similarly,  the  ith  codiagonal  crosses  t  S  cells  in  the 
postprocessor  (none  for  the  diagonal)  to  re-establish  the  time  skewed  codiagonals.  The  number  of 
cells  in  the  two  adjustment  arrays  comes  to 

fe (*“  •')  +  £ (*-«)  +  *)  +  fe*  +  £»)  = 

V,=i  «=j  /  \;=i  <=i  / 

and  including  the  transposition  array  this  makes  v(w  +  k)  cells.  If  the  first  element,  ati,  enters 
the  preprocessor  at  time  t  =  1  and  ann  is  the  last  element  to  leave  the  postprocessor,  having  to 
traverse  k  +  w  cells,  then  a  transposition  is  completed  in  time 

r(n,  Jfc  +  w)  =  2(n  +  k  +  w)  -  1  <  2(n  +  2to). 
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Figure  17.  Processors  for  Matrix  Transposition. 
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A  Systolic  Array  for  Algorithm  SVI 

The  processor  array  which  implements  SVI  (except  for  step  1)  is  a  special  case  of  the  bandwidth 
reduction  array  (q  =  1  ,p  =  2)  and  consists  of  three  linearly  connected  meshes,  a  total  of  5 
processors,  shown  in  Figure  19(a).  The  hardware  count  is  independent  of  n.  The  bottom  mesh, 
consisting  of  B2  cells  with  right  =  1,  is  responsible  for  the  formation  of 

—  A{Pj , 

that  is  the  initial  creation  of  the  bulge  (step  2  in  SVI).  P{,  computed  in  a  scalar  unit  (step  1  of 
SVI),  and  the  first  element  of  A,-,  an,  enter  the  network  at  the  same  time.  Alternatively,  since  B, 
differs  from  A,-  only  in  the  leading  two  rows  and  columns,  it  also  could  be  determined  by  a  scalar 
unit  and  the  bottom  mesh  omitted  (Figure  19(b)).  This  layer  alters  matrix  elements  during  the 
first  three  time  steps,  all  rotations  following  Pf  are  identities. 

Reduction  of  B,  (step  4  of  SVI)  is  done  entirely  by  cell  B  which  in  turn  performs  computations 
POS.2.1  and  PRE.2.1,  caused  by  on^  =  onpe,t  =  1.  POS.2.2  and  PRE.2.2  are  never  encountered 
during  this  iteration  and  can  thus  be  disregarded.  Another  B2  cell  in  the  second  mesh  delivers 
elements  of  the  bulge  to  cell  B.  The  total  time  to  determine  A,+i  =  B„  comes  to 

T(n,  2)  —  2 n  +  3, 

and  excluding  the  computation  of  B,  to  2n  +  1.  The  network,  I/O  format  and  execution  trace 
are  displayed  in  Figures  19  and  20.  After  each  singular  value  comp  station,  the  trailing  row  and 
column  of  At-  are  disregarded. 

The  plain  (unshifted)  algorithm  is  linearly  convergent.  If  Wilkinson’s  Shift  [Wi!k68]  is  used, 
the  algorithm  is  ultimately  quadraticaliy  convergent  and  in  practice  seems  to  be  cubically  con¬ 
vergent;  three  iterations  on  the  average  are  enough  to  compute  a  singular  value  [Parl80].  Since 
the  shift  is  determined  from  the  trailing  end  of  the  matrix  and  applied  to  the  leading  part,  it 
obstructs  the  pipelining  process  and  necessitates  transfers  to  memory  between  iterations.  Under 
these  circumstances,  the  average  number  of  (arithmetic)  steps  for  one  singular  value  amounts  to 

3T(n,  2)  =  6 n  +  5. 

Research  is  under  way  to  find  means  of  accelerating  convergence  (obviating  the  need  for  intermittent 
memory  transfers)  and  hence  a  computation  time  of 

T(n,2k)  —  2(n  +  2k)  -  1  for  some  t  >  1. 


(b)  Reduction  of  U,  to  Bidiagonal  Form  A,+t . 
Figure  19.  Continuation. 
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